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A8STRACT 

The development of numerical methods and soft- 
ware tools for parallel processors can be aided 
through the use of a hardware test-bed. The test-bed 
architecture must be flexible enough to support inves- 
tigations into architecture-algorithm interactions. 

One way to implement a test-bed is to use a commercial 
parallel processor. Unfortunately, most commercial 
parallel processors are fixed in their interconnection 
and/or processor architecture. In this paper, we 
describe a modified n-cube architecture, called the 
hypercluster, which is a superset of many other pro- 
cessor and interconnection architectures. The hyper- 
cluster is intended to support research into parallel 
processing of computational fluid and structural 
mechanics problems which may require a number of dif- 
ferent architectural configurations. An example of 
how a typical partial differential equation solution 
algorithm maps on to the hypercluster is given. 

INTRODUCTION 

Two research areas which are critical to the 
future progress of aerospace technology are computa- 
tional fluid mechanics (CFM) and computational struc- 
tural mechanics (CSM). The practical limits of 
applications in both of these areas are set by the 
state-of-the-art in computer architecture and soft- 
ware techniques. Parallel processing is an architec- 
tural concept which has the potential for vastly 
improving the performance of future computer systems. 
However, the use of parallel processing architectures 
will require a reasessment of numerical methods and 
software techniques that are currently used for CFM / 
CSM. Likewise, CFM/CSM requirements may impact future 
parallel architectures. 

Most CFM/CSM problems require the numerical solu- 
tion of a system pf nonlinear partial differential 
equations (PDE). There are many algorithms for solv- 
ing systems of POE’s on computers. The ideal algo- 
rithm for a given application minimizes computation 
time and the amount of memory required. A consider- 
able amount of research has been done in this area for 
uniprocessor computers, resulting in many accepted 
approaches for solving various PDE systems. The con- 
tinuing demand for more computing power and the emer- 
gence of supercomputer architectures employing 
parallel processors has prompted research into new 
approaches to solving systems of PDE's (Ortega and 
Voigt 1985). The goal of that research is the devel- 
opment of higher performance CFM/CSM codes that can 
effectively utilize the new parallel archi tectures. 

The development of algorithms for parallel pro- 
cessors is not a straightforward task. Algorithms for 
parallel processors must be able to be partitioned 
into independent tasks that can be allocated to multi- 


ple processors for simultaneous execution. A high 
degree of parallelism does not guarantee higher per- 
formance, however. The development of parallel algo- 
rithms can be complicated by the hardware aspects of 
parallel processors. The communication mechanism 
between processors is one example. The algorithm 
should be analyzed to determine if fast, tightly 
coupled communication between processors is required, 
or if a slower, loosely coupled mechanism will 
suffice. 

The individual processor architecture can also 
impact the performance of an algorithm. For example, 
a vector processor architecture operates most effec- 
tively by performing a single mathematical operation 
on large arrays of data. The performance of a vector 
processor is dependent on the length of the data 
arrays, or vectors. Therefore, it is desirable to 
develop algorithms which make use of long vector oper- 
ations. If parallel vector processors are used, then 
any partitioning of the numerical method should avoid 
shortening the vector length to the point of degrad- 
ing performance. 

The memory hierarchy employed in a parallel pro- 
cessor is another consideration in the development of 
parallel algorithms. The use of local processor 
memory and/or global shared memory are examples of 
memory hierarchy within a parallel architecture. 

Cache memory, interleaved memory and mass storage are 
levels of the memory hierarchy local to the processors 
in a parallel processing system. An efficient paral- 
lel algorithm must make optimum use of the existing 
memory hierarchy. This requires maximizing the amount 
of computation occurring in the lowest (i.e., fastest) 
level of the hierarchy. 

To summarize, the development of parallel algo- 
rithms requires cognizance of a larqe number of hard- 
ware and architectural parameters. This makes the 
evaluation of alqorithm performance a critical step 
in the development process. To some extent, this can 
be done analytically. A detailed analytical perform- 
ance evaluation would be cumbersome, however, espe- 
cially if the number of hardware and architectural 
parameters is high. A preferable approach would be 
the evaluation using a hardware test-bed. Then hard- 
ware and architectural parameters could be directly 
implemented, or efficiently emulated. 

A research effort at the NASA Lewis Research 
Center is devoted to studying the application of par- 
allel processing to CFM/CSM. This effort is an out- 
growth of work previously done on the Real-Time 
Multiprocessor Simulator (RTMPS) project (Arpasi 1985; 
Blech and Arpasi 1985; Cole 1985; Arpasi and Milner 
1985). To facilitate the investigation of algorithm- 
architecture interactions and the evaluation of soft- 
ware tools, a reconf igurable hardware test-bed is 
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being assembled. This paper discusses the require- 
ments driving the design of the parallel processing 
test-bed and describes the test-bed architecture being 
implemented at NASA Lewis. An example of how a typi- 
cal PDE solution algorithm would map on to the archi- 
tecture is presented. 

Parallel Processing Test-Bed Requirements 

In general, the purpose of a parallel processing 
test-bed is to support the development of parallel 
algorithms and the evaluation of software tools. 

Since many of the architectural requirements for a 
particular algorithm or software tool usually are not 
known, the test-bed must provide a degree of flexibil- 
ity in configuration. This suggests some of the 
following desirable capabilities for any parallel pro- 
cessing test-bed. 

(1) Ability to incorporate processors of various 
architectures within the parallel processing configur- 
ation. This allows evaluation of how the architecture 
and performance of the individual processing elements 
within a parallel system architecture can affect over- 
all performance. Some processor architectural charac- 
teristics to be considered are vector processing 
capability, memory configuration (cache memory, inter- 
leaving), and specialized coprocessors (floating- 
point, graphics). 

(2) Ability to emulate a wide variety of paral- 
lel processing architectures. The impact of inter- 
processor communication overhead is a critical issue 
in parallel processing research. The ability to vary 
the system architecture (and thereby the interpro- 
cessor communication paths) allows investigations into 
architecture-algorithm interactions. 

(3) Ability to modify the I/O structure of the 
parallel processor. Input and/or output processing 
are the dominant time consumers for some applications. 
The ability to modify or augment the I/O structure 
allows researching of distributed database techniques 
and partitioning of the I/O task. 

(4) Capability to expand to a large scale 
parallel system. This is necessary to evaluate algo- 
rithms requiring a large number of processors for 
effectiveness. 

The usefulness of a parallel processing research 
test-bed having the above characteristics was recog- 
nized by researchers involved in IBM’s Research Paral- 
lel Processor Project { RP3) (Pfister 1985). However, 
the RP3 architecture is neither commercially available 
nor easy to replicate. Commercial versions of some 
parallel processing architectures have recently become 
available. In most cases, the architecture is fixed 
and/or the user has limited capabilities for architec- 
tural or processor modif ications. For example. 

Alii ant ' s F X / 8 machine (Alliant Computer Systems 1985) 
has multiple vector processors interconnected through 
shared memory. However, the current architecture is 
limited to eight processors. The B8N Butterfly 
(Crowther et al. 1985) has a large number of scalar 
processors communicating through shared memory, but 
lacks vector processing capability. Flexible Com- 
puters’ FLEX/32 (Manuel 1985) combines both message 
passing and shared memory communication mechanisms, 
but again lacks vector processing capability. 

The n-cube architecture, also known as the hyper- 
cube, is becoming a popular architecture due to its 


expandability and capability to emulate other archi- 
tectures. A hypercube that has vector processing 
capability at each node (two commercial versions of 
which are discussed in Gustafson et al. 1986 and 
Robinson 1985) meets several of the requirements for 
a parallel orocessing test-bed. The strong points 
are: 1) an architecture which is expandable in a 

systematic manner, 2) vector and scalar processing 
capability at each node, and 3) the ability to emulate 
a limited number of other architectures. Emulation 
of shared memory architectures on the hypercube is 
difficult, however. This is especially true for 
applications exhibiting a fine-grained parallel struc- 
ture, such as linear algebra. The difficulty can be 
traced to the interprocessor communications in the 
hypercube, which exhibit high overhead for two rea- 
sons. First, a routing algorithm is required for all 
but those applications which directly map on to the 
hypercube network. This consumes processor resources 
since the communication path from one processor to 
another must be calculated. Second, most commercial 
versions of the hypercube implement the network inter- 
connections with fixed serial links. The net through- 
put rate on these links is relatively low when 
packetization and software protocol is taken into 
account. In addition, the link connections cannot be 
reconfigured. 

Hypercluster Architecture 

A modified version of the hypercube architecture, 
called the hypercluster, is proposed to overcome the 
difficulties described above. The hypercluster 
retains the hypercube network structure between pro- 
cessor nodes, but each node now consists of multiple 
processors communicating through a shared memory. 

This concept is illustrated in Figure 1, for a dimen- 
sion 2 (0-2) cube. Each circle labelled 'M* repre- 
sents a shared memory at a node. Each square labelled 
*P* is a processing element interconnected to the 
shared memory in some fashion. Processors can have 
local memory in addition to shared memory. Communi- 
cation links between nodes form the hypercluster 
network. 

The hypercluster supports both tiqhtly coupled 
interprocessor communication via shared memory (within 
a node) and loosely coupled communication through the 
hypercube network (between nodes). The hypercluster 
is expanded in the same manner as the hypercube, with 
processor clusters replacing the normal single pro- 
cessor node. An arbitrary number of processors may 
be assigned to a cluster, limited only by the hardware 
constraints of the shared memory interconnect and/or 
power requirements. 

Figure 2 shows a more detailed diagram of the D-2 
hypercluster configuration being implemented at the 
NASA Lewis Research Center. The nodes consist of 
multiple board-level computers interconnected by a 
commercial bus. Although a bus is not the highest 
performance shared-memory interconnect mechanism 
available, it does allow for convenient implementa- 
tion. In addition, the use of a commercial bus allows 
a variety of processor architectures to be incorpo- 
rated within a node. Thus each node has an architec- 
tural ’personality’ determined by the type of 
processor boards connected to the bus. The NASA Lewis 
D-2 hypercluster has three nodes with a vector per- 
sonality and one node with a scalar personality. Each 
of the vector nodes uses four board-level vector pro- 
cessors, while the scalar node uses four general pur- 
pose microcomputer boards. The incorporation of 
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vector processors is crucial in the investigation of 
CFM/CSM algorithms because many CFM/CSM algorithms 
contain large arrays of independent computations that 
are best handled by a vector architecture. The avail- 
ability of multiple vector processors allows very 
large arrays of calculations to be broken up and dis- 
tributed for a parallel solution. 

There are two types of communication links. 
Internode communication links form the hypercluster 
network as described before. Additional links provide 
communication paths between each node and a front-end 
processor (FEP). The FEP allows a user to interact 
with the hypercluster. Each communication link con- 
sists of two control processors (CP) interconnected 
by a dual-port memory. The CP's coordinate communi- 
cation over the links and supervise the operation of 
processors within a node. Executive software in each 
CP performs these functions. For the D-2 hyper- 
cluster, it is both practical and advantageous to have 
a communication link between each node and the FEP. 
However, as the hypercluster is expanded to more 
nodes, the associated size and cost constraints make 
this approach impractical. In that case, most nodes 
will not have an FEP link. Software will then be 
necessary to route information from nodes with an FEP 
link to those without. 

Shared memory within a node consists of memory 
boards connected to the node's bus, and/or dual-ported 
memory on the processor boards. Dual-ported memory 
has become a standard feature on many commercial com- 
puter boards, and is particularly useful in the hyper- 
cluster environment. Through software, memory seg- 
ments can be allocated as local to a processor, global 
to all processors in a node, or a combination of local 
and global segments. This allows emulation of the 
different memory hierarchies used in parallel pro- 
cessing systems. 

Each node of the hypercluster can have its own 
local I/O capability. For example, each node can have 
a disk contro 1 processor and hard disk drive. This 
arrangement would allow research in distributed I/O 
and database techniques, aimed at eliminating the I/O 
bottleneck present in many applications. 

An Algorithm Example 

The alternating direction implicit (ADI) algo- 
rithm is a technique commonly used for the solution 
of partial differential equations (PDE) (Gerald 1980). 
The two stages of the ADI algorithm are shown in 
Figure 3 for a 4 by 4 grid. 

In the first stage, equations are formed which 
are implicit (i.e., depend on current time step infor- 
mation) in the X direction only. Thus a coefficient 
matrix A and vector b can be generated to form the 
system Ax = b which describes one row of points. 
Several such systems are formed to describe the entire 
grid. The matrix A is a tridiagonal matrix (the 
matrix is block tridiagonal if several PDE’s are 
solved at each grid point). The second stage of the 
ADI algorithm begins after the equations from the 
first stage are solved. It is identical to the first 
stage except that now the equations are implicit in 
the Y direction only. 

Each system of equations for a row or column is 
independent (in the current time step) of information 
from neighboring rows or columns. Only information 
from past time steps for neighboring rows or columns 


is used. This characteristic of the ADI algorithm 
makes it particularly attractive for solution on a 
parallel processor. The solution of rows or columns 
can be done in parallel. Each row or column can be 
allocated to a processor, if sufficient processors are 
available. Otherwise, groups of rows or columns must 
be formed, where the number of groups would equal the 
number of available processors. 

The first stage of the ADI algorithm would map 
onto the hypercluster as shown in Figure 4. For the 
simple 4 by 4 grid and D-2 hypercluster shown, each 
row would be solved on a hypercluster node. If the 
grid were larger, groups of rows would be assigned to 
the node, or more nodes could be added. The processor 
allocation described thus far could be accomplished 
on any hypercube implementation. The advantage of the 
hypercluster architecture for the ADI algorithm is the 
ability to apply the tightly coupled multiple proces- 
sors within each node to the simultaneous solution of 
the equation systems. The allocation of rows to 
hypercluster nodes results in one or more block tri- 
diagonal equation systems which must be solved at each 
node. After the parallel solution of the equation 
sets is completed, information to and from neighbor- 
ing rows is transmitted between nodes via the hyper- 
cube network. Then the second stage of the ADI 
algorithm can proceed with columns allocated to hyper- 
cluster nodes. After solution of both stages, the 
results are checked for convergence. If convergence 
has not been achieved, the whole process is repeated. 
The parallel ADI algorithm is outlined by the psuedo- 
code in Figure 5. 

The ADI algorithm is only one of many algorithms 
available to solve a system of partial differential 
equations. The partitioning of the calculations for 
the ADI algorithm described above is one of many pos- 
sible methods. This example has been given only to 
demonstrate the usefulness of the hypercluster archi- 
tecture for implementing a particular algorithm. 

A number of parallel processing algorithms for 
solving partial differential equations have been pro- 
posed in Hockney and Jesshope 1981. Some of these 
algorithms are also vectorizable. Future work using 
the hypercluster as a test-bed will attempt to deter- 
mine which algorithms (combined with the appropriate 
archi tecture) are optimum for CFM/CSM applications. 

CONCLUDING REMARKS 

The hypercluster architecture is intended to pro- 
vide a reconfigurable test-bed on which various paral- 
lel processing algorithms, programming and operating 
tools can be developed. There is still a considerable 
amount of uncertainty as to the optimum parallel pro- 
cessing architecture for specific applications such 
as CFM and CSM. There is also a definite lack of pro- 
gramming and operating software that will allow 
researchers to easily take advantage of parallel pro- 
cessing. Future work using the hypercluster test-bed 
will attempt to address some of these issues. It will 
allow CFM/CSM research at the NASA Lewis Research 
Center to readily adapt to the rapidly developing 
discipline of parallel processing. 
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X DIRECTION 



STA6E 2 

FIGURE 3. - SWEEP PATTERN FOR ALTERNATING DIRECTION 
IMPLICIT METHOD. 






REPEAT 

FOR ROW = 1 TO NROWS DO IN PARALLEL 
BEGIN 

CALCULATE COEFFICIENTS OF MATRIX A, VECTOR b 
SOLVE AX K+1 = b VIA PARALLEL ALGORITHM 
END 

TRANSFER DATA TO NEIGHBORING NODES 
FOR COLUMN = 1 TO NCOLUMNS DO IN PARALLEL 
BEGIN 

CALCULATE COEFFICIENTS OF MATRIX A, VECTOR b 
SOLVE Ax k+2 = b VIA PARALLEL ALGORITHM 
END 

TRANSFER DATA TO NEIGHBORING NODES 
UNTIL Ax < e 

FIGURE 5. - PSEUDOCODE FOR PARALLEL ADI ALGORITHM. 
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